Case study: queue analysis

This case study illustrates the use of the staircase package for queue analysis. In this example we have a number of vessels (i.e. ships) which arrive offshore and await their turn to enter a harbour where they will be loaded with cargo. We will examine the queue, which is composed of all vessels which are offshore but yet to enter the harbour, for the year 2020.

The data used is this case study is synthetic and fictional. Both data and the notebook for this tutorial can be obtained from the github site.

[1]:
import pandas as pd
import staircase as sc
import matplotlib.pyplot as plt

We begin by importing the queue data into a pandas.DataFrame instance. Each row corresponds to a vessel. The first column gives the time at which the vessel arrives offshore (enters the queue), and the second column gives the time at which the vessel enters the harbour (leaves the queue). A NaT value in either of these columns indicates the vessel entered the queue prior to 2020, or left the queue after 2020, however this approach does not require these values to be NaT. The third column gives the weight of cargo destined for the vessel. Note, for the staircase approach to work we require every vessel, that was in the queue at some point in 2020, to appear in the dataframe.

[2]:
data = pd.read_csv(r"data/vessel_queue.csv", parse_dates=['enter', 'leave'], dayfirst=True)
data
/home/docs/checkouts/readthedocs.org/user_builds/railing/envs/v1.1.0/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)
[2]:
enter leave tonnes
0 NaT 2020-01-01 04:40:00 129000
1 NaT 2020-01-01 22:18:00 69055
2 NaT 2020-01-01 11:47:00 138000
3 NaT 2020-01-02 10:12:00 84600
4 NaT 2020-01-01 22:39:00 142550
... ... ... ...
1224 2020-12-29 05:59:00 NaT 142500
1225 2020-12-29 17:41:00 NaT 84600
1226 2020-12-30 12:41:00 NaT 119200
1227 2020-12-30 16:59:00 NaT 113200
1228 2020-12-30 17:59:00 NaT 142500

1229 rows × 3 columns

The layer method can be used with array-like parameters. The creation of a step function to quantify the size of the queue is as simple as calling the layer method with a vector of times that vessels enter the queue, and a vector of times that vessels leave the queue - the columns “enter” and “leave” respectively:

[3]:
queue = sc.Stairs(use_dates=True).layer(data.enter, data.leave)
queue.plot()
/home/docs/checkouts/readthedocs.org/user_builds/railing/envs/v1.1.0/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)
[3]:
<AxesSubplot:>
../_images/examples_Case_Study_Queue_Analysis_5_2.png

Assuming that no vessels arrive precisely at midnight on the 1st of Jan, we expect the number of vessels in the queue at this time to be equal to the number of NaT values in the “enter” column:

[4]:
queue(pd.Timestamp('2020-01-01'))
[4]:
10
[5]:
data.enter.isna().sum()
[5]:
10

Another useful queue metric is the “queue tonnes”. This is the sum of the cargo tonnes destined for vessels in the queue. A step function representing this variable is also straightforward by using the third parameter of the layer method - the values representing how much the step function should increase or decrease whenever the corresponding vessels enter or leave the queue:

[6]:
queue_tonnes = sc.Stairs(use_dates=True).layer(data.enter, data.leave, data.tonnes)
queue_tonnes.plot()
[6]:
<AxesSubplot:>
../_images/examples_Case_Study_Queue_Analysis_10_1.png

We can use this queue_tonnes object to answer questions like “what was the maximum queue tonnes in 2020?”

[7]:
queue_tonnes.max()
[7]:
2138466
[8]:
queue_tonnes.mean(pd.Timestamp('2020'), pd.Timestamp('2021'))
[8]:
817852.3209585852

or “what fraction of the year was the queue_tonnes larger than 1,500,000 tonnes?”

[9]:
(queue_tonnes > 1500000).mean(pd.Timestamp('2020'), pd.Timestamp('2021'))
[9]:
0.09867751973285666
[10]:
queue_tonnes.median(pd.Timestamp('2020-3-1'), pd.Timestamp('2020-4-1'))
[10]:
1281300.0

The median gives us the 50th percentile, but we might be interested in the 80th percentile? We can do that:

[11]:
queue_tonnes.percentile(80, pd.Timestamp('2020-3-1'), pd.Timestamp('2020-4-1'))
[11]:
1420500.0

In fact we can even get a percentile function, represented by a Stairs object itself. This function is essentially the inverse of an empirical cumulative distribution function.

[12]:
inv_ecdf = queue_tonnes.percentile_Stairs(pd.Timestamp('2020'), pd.Timestamp('2021'))
inv_ecdf
/home/docs/checkouts/readthedocs.org/user_builds/railing/envs/v1.1.0/lib/python3.7/site-packages/staircase/stairs.py:1448: PendingDeprecationWarning: Stairs.percentile_Stairs will be deprecated in version 2.0.0, use Stairs.percentile_stairs instead
  PendingDeprecationWarning
[12]:
<staircase.Stairs, id=139856727768312, dates=False>

We can plot this function of course, since it is represented by a Stairs object:

[13]:
inv_ecdf.plot()
[13]:
<AxesSubplot:>
../_images/examples_Case_Study_Queue_Analysis_24_1.png

The 100th percentile should be the same as the maximum queue tonnes we found earlier. Let’s check:

[14]:
inv_ecdf(100) == queue_tonnes.max()
[14]:
True

What is the 40th, 65th, 77th and 90th percentiles? The sample method, which is aliased by call, can be called with a vector of values at which to evaluate the step function too.

[15]:
inv_ecdf([40, 65, 77, 90])
[15]:
[659500, 1035400, 1220600, 1496800]

Returning to our queue plots… They’re pretty noisy. Perhaps a daily average will suffice. To achieve this let’s use Python’s zip function, list comprehension and a pandas.Series to derive and collect this data:

[16]:
yr2020 = pd.date_range('2020', '2021')
daily_mean_queue = pd.Series(
    [queue.mean(d1,d2) for d1,d2 in zip(yr2020[:-1], yr2020[1:])],
    index = yr2020[:-1]
)
daily_mean_queue
[16]:
2020-01-01     9.673611
2020-01-02    10.056944
2020-01-03     7.490278
2020-01-04     5.312500
2020-01-05     5.288194
                ...
2020-12-27    15.424306
2020-12-28    14.813194
2020-12-29    16.374306
2020-12-30    15.004861
2020-12-31    16.000000
Freq: D, Length: 366, dtype: float64

We can call the pandas.Series.plot method for a quick visualisation.

[17]:
daily_mean_queue.plot()
[17]:
<AxesSubplot:>
../_images/examples_Case_Study_Queue_Analysis_32_1.png

Since the data is now in a series it’s easy to apply a rolling window. This data can be plotted with matplotlib, or seaborn, but for now let’s keep leveraging the pandas.Series plotting methods:

[18]:
fig, ax = plt.subplots(figsize=(20,5))
daily_mean_queue.plot(ax=ax, label="queue size")
daily_mean_queue.rolling(7, center=True).mean().plot(ax=ax, linewidth=3, label="rolling mean")
ax.legend()
/home/docs/checkouts/readthedocs.org/user_builds/railing/envs/v1.1.0/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)
[18]:
<matplotlib.legend.Legend at 0x7f32ee571940>
../_images/examples_Case_Study_Queue_Analysis_34_2.png